SOC128D - Analytics for a Changing Climate: Introduction to Social Data Science

Final Project Report

Study on the Effects of High Temperature on Mental Health in California

1. Literature Review

To gain a preliminary understanding of the relationship between high temperature and mental health, we conducted a literature review. Firstly, Thompson et al. (2018) conducted a systematic review to illuminate the association between high temperatures and mental health. Through their review, we noted the different categories of mental health that others have explored, namely (A) Suicide (B) Bipolar Disorder (C) Mania, Depression and Heat (D) Organic mental health outcomes, including dementia and (E) Alcohol and substance misuse. Given the limited timeframe of this project, it was not possible to cover all 6 areas and hence we decided to focus on the area of suicide, especially given the relatively larger number of studies that discuss about this aspect as compared to the rest. Notably, the authors underscored the two different views on the relationship between high temperature and mental health, one of which believes there’s a causal relationship between them and the other does not believe that it is the case.

On a macro-level, a study in Australia done between 1970 and 2007 found that males, aged between 30-49 and lived in rural areas, experienced an increase in the relative risk of suicide (Hanigan et al., 2012) (https://doi.org/10.1073/pnas.1112965109). The increase coincided with an increase in drought index between the first and third quartile of every year in the same period, suggesting that there might be a causal relationship between abnormal temperature patterns (which is closely related to drought index) and mental health. Similarly, Deisenhammer et al. (2003) (https://doi.org/10.1046/j.0001-690x.2003.00219) found that high ambient temperatures were closely associated with a range of mental health effects, with the most compelling evidence found for increased suicide risk. On a micro-level, Lohmus (2018) (https://doi.org/10.3390/ijerph15071515) suggested that high temperatures may lead to poor sleep quality and sleep deprivation, both which may contribute to the development of mental health disorders and persistence of existing mental health disorders.

On the other hand, a study analysing experiments, which purported close links between rising temperatures and suicide risk, found that many of these studies did not consider long term climate patterns or lacked sufficient sample size for the results to be generalisable (Dixon et al., 2007) (https://doi.org/10.1007/s00484-006-0081-4). Researchers of the same study also employed simple linear regression to highlight that links between monthly suicide rates and mean suicide rates for certain American counties were weak. In the same vein, Thompson et al. (2018) (https://doi.org/10.1016/j.puhe.2018.06.008) concluded that there was limited evidence for increased heat-related morbidity and deaths amongst those with mental health disorders and acknowledged gaps existed in knowledge about the dynamics between heat and mental health-related deaths.

3. Correlations

3.1 Correlation Graphs

To understand the relationship/association between temperature and mental health, correlation graphs were used to summarize the degree and direction of correlation between the different variables. This was first done using statewide yearly data.

combined_data = read.csv("proj data/[Combined] Calfornia County Suicide and Temperature Data.csv")
correlation_matrix_county <- cor(combined_data)
ggcorrplot(correlation_matrix_county, type = "full", outline.color = "white",
           colors = c("#FC4E07", "white", "#00AFBB"),
           lab = TRUE,
           lab_size = 1.5,
           tl.cex = 6)

Figure 33: Correlation between Suicide & Temperature (Statewide Data)
ggcorrplot(correlation_matrix_county, method = "circle",
           colors = c("#FC4E07", "white", "#00AFBB"),
           tl.cex = 6)

Figure 34: Correlation between Suicide & Temperature (Statewide Data)

From the above correlation graphs, we see that the correlation between temperature measures and suicide is positive but weak, while the correlation between precipitation and suicide is negative but weak.

However, if we look at the changes in these variables, we see that there is a negative but weak correlation between change in temperature measures and change in suicide rates. The correlation between change in precipitation and change in suicide rates is negative but very weak.

Similarly, correlation graphs were plotted using county-level yearly data, to observe the correlation at county-level.

combined_data_overall = read.csv("proj data/[Combined] California Overall Suicide and Temperature.csv")
#View(combined_data_overall)
correlation_matrix_overall <- cor(combined_data_overall)
#View(correlation_matrix)
#str(combined_data_overall)
ggcorrplot(correlation_matrix_overall, type = "full", outline.color = "white",
           colors = c("#FC4E07", "white", "#00AFBB"), 
           lab = TRUE,
           lab_size = 1.5,
           tl.cex = 6)

Figure 35: Correlation between Suicide & Temperature (County-Level Data)
ggcorrplot(correlation_matrix_overall, method = "circle",
           colors = c("#FC4E07", "white", "#00AFBB"),
           tl.cex = 6)

Figure 36: Correlation between Suicide & Temperature (County-Level Data)

Unlike the plots that used statewide data, the correlation graphs from the county-level data showed a negative but weak correlation between temperature measures and suicide. The correlation between precipitation and suicide was still seen to be negative but weak. However, such comparison does not take into consideration that different counties have different baseline temperatures

If we look at the changes in this variables, we see that there is a positive but very weak correlation between the change in temperature measures and the change in suicide rates. The correlation between the change in precipitation and the change in suicide rates positive but very weak.

3.2 Scatter Plots (with Regression)

Data wrangling was done to consolidate the different datasets before we could explore the county-level data deeper. This includes joining the datasets, removing unnecessary data points (eg. NA rows) and renaming columns for easier referencing.

df1 <- read.csv("proj data/County_Level_California_Avg,_Mix,_Max_Temp_and_Precipitation.csv")
df2 <- read.csv("proj data/[County Level] California County Suicide Rate.csv")
df3 <- read.csv("proj data/County_Level_Hospitalizations_for_Mental_Health_Issues,_Age_5_19.csv")

#full <- read.csv("proj data/[Combined] Calfornia County Suicide and Temperature Data.csv")
full_county <- list(df2,df3,df1) %>%
  reduce(left_join, by = c("YEAR", "COUNTY"))
#head(full_county)
#full_county <- subset(full_county, select = -c(X, X.1))  

colnames(full_county) <- colnames(full_county) %>% 
  str_replace_all("\\.", "_") %>% 
  tolower()
colnames(full_county) = gsub("_$", "", colnames(full_county))
colnames(full_county) = gsub("__", "_", colnames(full_county))

colnames(full_county)[colnames(full_county) == "rate_per_1000_people"] = "hospitalizations_rate_per_1000_people"
colnames(full_county)[colnames(full_county) == "change_in_rate"] = "change_in_h_rate"

full_county <- na.omit(full_county)
#head(full_county)

A preliminary plot was generated to understand the general relationship (degree and direction) between the variables of interest. We looked at the average temperature, precipitation, suicide rate (per 100,000) and hospitalization rate (per 1000) in one plot, and the change in these variables in another plot.

preliminary_plot1 <- subset(full_county, select = c(5,7,10,15))
#head(preliminary_plot1)
plot(preliminary_plot1, main = "Relations between Variables")

Figure 37: Preliminary Plots showing Relationship between Variables

We were interested to understand more about the relationship between the independent variables (temperature and precipitation) and the dependent variables (mental health data). From the first preliminary plot, the trends for certain relationships (eg. precipitation and suicide rate) were more distinct, while others (eg. temperature vs hospitalization rate) were inconclusive.

preliminary_plot2 <- subset(full_county, select = c(6,8,16,19))
#head(preliminary_plot2)
plot(preliminary_plot2, main = "Relations between Change in Variables")

Figure 38: Preliminary Plots showing Relationship between Change in Variables

Compared to the first preliminary plot, the second preliminary plot showed slightly clearer trends. For example, a seemingly linear relationship can be seen between the change in suicide rate and the change in average temperature.

Next, scatter plots were done for each pair of variables, along with a regression line to better visualize the relationship between two variables. We also experimented with the visualizations by using colours for different counties and also by using the facet_wrap() function.

3.2.1 Average Temperature vs Change in Suicide Rate

ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_suicide_rate, color = county)) +
  geom_point(size=1.2) +
  geom_line(linewidth=0.7) +
  theme(legend.position = "none") + 
  labs(x = "Change in Average Temp (C)", y = "Change in Suicide Rate (per 100,000 people)", title = "Change in Suicide Rate against Change in Average Temp (C)")

Figure 39: Change in Suicide Rate (per 100,000) against Change in Average Temperature (C)

It is difficult to distinguish clear relationship from the above plot, due to the overlapping points and large number of counties. Hence, it may be better to visualize using the facet_wrap() function, and be using regression lines to observe overall trend for each county.

ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_suicide_rate)) +
  geom_point() +
  geom_smooth(aes(group=county), method = "lm", se = FALSE, color = "red", linewidth = 1, alpha = 0.6) +
  facet_wrap(~county) +
  theme(legend.position = "none") + 
  labs(x = "Change in Average Temp (C)", y = "Change in Suicide Rate (per 100,000 people)", title = "Change in Suicide Rate against Change in Average Temp (C)")
## `geom_smooth()` using formula = 'y ~ x'

Figure 40: Change in Suicide Rate (per 100,000) against Change in Average Temperature (C) (by County) (Linear Regression)

From the straight-line linear model, there is a lack of correlation seen in most counties. In some counties (e.g. Amador, Fresno), there seems to be slight level of regression, however the direction seems to vary and the magnitude of change is generally small.

We could also use a linear model analysis to statistically verify the suitability of using a linear model to fit the data points.

suicide_lm <- lm(change_in_suicide_rate~change_in_avg_temp, full_county)
summary(suicide_lm)
## 
## Call:
## lm(formula = change_in_suicide_rate ~ change_in_avg_temp, data = full_county)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.453  -2.133   0.023   1.891  91.568 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)          0.1815     0.4350   0.417    0.677
## change_in_avg_temp   0.4550     0.6597   0.690    0.491
## 
## Residual standard error: 9.728 on 498 degrees of freedom
## Multiple R-squared:  0.0009543,  Adjusted R-squared:  -0.001052 
## F-statistic: 0.4757 on 1 and 498 DF,  p-value: 0.4907

When we used the county-level dataset as a whole, we saw that there is a very slight positive gradient (0.08803) in the relationship between change in temperature and change in suicide rates. From the p-value (0.0338), we conclude that at 95% significance level, this gradient value is statistically significant. However, the low R^2 value of 0.009015 suggests that change in temperature alone does not explain much of the suicide rates in California.

We could use smooth regression to better account for variations in trends (eg. is there a range of temperatures where effect on suicide rates is larger).

ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_suicide_rate)) +
  geom_point() +
  geom_smooth(aes(group=county), method = "loess", se = FALSE, color = "red", linewidth = 1, alpha = 0.6) +
  facet_wrap(~county) +
  theme(legend.position = "none") + 
  labs(x = "Change in Average Temp (C)", y = "Change in Suicide Rate (per 100,000 people)", title = "Change in Suicide Rate against Change in Average Temp (C)")
## `geom_smooth()` using formula = 'y ~ x'

Figure 41: Change in Suicide Rate (per 100,000) against Change in Average Temperature (C) (by County) (Smooth Regression)

We observed that the use of smooth regression allows us to visualize the relationship more accurately. It is also interesting to note that the relationship is not necessarily strictly increasing or decreasing for most counties.

3.2.2 Average Temperature vs Change in Hospitalization Rate

ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_h_rate, group = county, color = as.factor(county))) +
  geom_point(size=1) +
  geom_line() +
  labs(y = "Change in Hospitalization Rate (per 1000 people)", title = "Change in Hospitalization Rate against Change in Average Temp (C)") +
  scale_x_continuous(name = "Change in Average Temp (C)") +
  theme(legend.position = "none") 

Figure 42: Change in Hospitalization Rate (per 1000) against Change in Average Temperature (C)

Similar to the data for change in suicide rates, it is difficult to discern the relationship from the above plot, due to the overlapping points and large number of counties. Hence, it may be better to visualize using the facet_wrap() function, and be using regression lines to observe overall trend for each county.

ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_h_rate)) +
  geom_point(size = 1) +
  geom_smooth(aes(group=county), method = "lm", se = FALSE, color = "red", linewidth = 1, alpha = 0.6) +
  labs(y = "Change in Hospitalization Rate (per 1000 people)", title = "Change in Hospitalization Rate against Change in Average Temp (C)") +
  facet_wrap(~county) + # too many counties to see clearly
  scale_x_continuous(name = "Change in Average Temp (C)") +
  theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'

Figure 43: Change in Suicide Rate (per 1000) against Change in Average Temperature (C) (by County) (Linear Regression)

The linear model shows a lack of correlation, as seen in most counties. In some counties (e.g. Alpine, Mono), there seems to be slight level of regression, however the direction seems to vary and the magnitude is generally small. Likewise, we also ran a linear model analysis for the change in hospitalization rates data.*

hospitalization_lm <- lm(change_in_h_rate~change_in_avg_temp, full_county)
summary(hospitalization_lm)
## 
## Call:
## lm(formula = change_in_h_rate ~ change_in_avg_temp, data = full_county)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53961 -0.37606 -0.03169  0.32826  2.68292 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.18803    0.02728   6.894 1.65e-11 ***
## change_in_avg_temp  0.08803    0.04136   2.128   0.0338 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6099 on 498 degrees of freedom
## Multiple R-squared:  0.009015,   Adjusted R-squared:  0.007025 
## F-statistic:  4.53 on 1 and 498 DF,  p-value: 0.03379

We saw a slight positive gradient (0.455) in the relationship between change in temperature and change in hospitalization rates. From the p-value (0.4907), we conclude that at 95% significance level, this gradient value is not statistically significant. The low R^2 value of 0.0009543 also suggests that the change in temperature alone does not explain much of the change in hospitalization rates in California.

ggplot(full_county, aes(x=change_in_avg_temp, y=change_in_h_rate)) +
  geom_point(size = 1) +
  geom_smooth(aes(group=county), method = "loess", se = FALSE, color = "red", linewidth = 1, alpha = 0.6) +
  labs(y = "Change in Hospitalization Rate (per 1000 people)", title = "Change in Hospitalization Rate against Change in Average Temp (C)") +
  facet_wrap(~county) + # too ma y counties to see clearly
  scale_x_continuous(name = "Change in Average Temp (C)") +
  theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'

Figure 44: Change in Hospitalization Rate (per 1000) against Change in Average Temperature (C) (by County) (Smooth Regression)

We observed that using smooth regression allows us to visualize the relationship more accurately. It is also interesting to note that the relationship is not necessarily strictly increasing or decreasing for most counties.

4. Sentiment Analysis

Sentiment analysis of text data can allow us to understand more about how the population views the issue of rising temperatures and can potentially shed light on the mental health of the population as a whole. To gain a better understanding of these areas, we looked into specific words in Google trends, such as “heat” and “insomnia”, terms that can provide us some clues on the potential effects of rising temperatures and whether there is actually increasing interest in the rising temperature.

Increasing searches or interest in keywords may denote increasing concern and worry about the rising temperature. At the same time, increasing searches for key terms like insomnia may suggest an increasing number of people who suffer from poor sleep quality, which could be due to the rising temperatures, as purported by Lohmus (2018). To visualize the effects, we plotted several line graphs of search counts versus time with the data collected.

4.2 Reddit

We also extracted data from Reddit and conducted analysis for a few keywords like “heat”, “hot”, and “warm” over the different lexicons.

4.2.1 Keyword: Heat

#heat_urls <- find_thread_urls(keywords = "heat", subreddit="California", period  = "all")
#heat_comments <- get_thread_content(heat_urls$url)

heat_comments <- read.csv("proj data/heat_comments_reddit.csv")
#head(heat_comments)
tidy_comments <- heat_comments %>%
  unnest_tokens(word, comment) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_comments %<>%
  anti_join(stop_words)
## Joining, by = "word"
#View(tidy_comments)
afinn <- tidy_comments %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = timestamp %/% 604800) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing <- tidy_comments %>% 
  inner_join(get_sentiments("bing")) %>% 
  mutate(method = "Bing et al.") %>%
  count(method, index = timestamp %/% 604800, sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
nrc <- tidy_comments %>% 
  inner_join(get_sentiments("nrc") %>%
               # only get positive and negative sentiments
               filter(sentiment %in% c("positive", "negative"))) %>% 
  mutate(method = "NRC") %>%
  count(method, index = timestamp %/% 604800, sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
all_three <- bind_rows(afinn,
                       bing, 
                       nrc)

ggplot(all_three, aes(index, sentiment, fill = method)) +
  geom_col(aes(fill = method))+
  facet_wrap(~method, ncol = 1, scales = "free_y")

Figure 51: Sentiment Analysis in Reddit Threads with keyword “Heat” over Time

Sentiment analysis of the keyword “heat” shows a relatively sparse dataset, especially during the earlier years. There is a lack of general trends to be drawn from the sentiment analysis, even though we can see that there are more extreme positive and negative sentiments regarding the search term “heat” in the later years.

nrc_sentiment <- tidy_comments %>%
  inner_join(get_sentiments("nrc"))
## Joining, by = "word"
nrc_sentiment %>%
  select(word, sentiment) %>%
  head()
##     word    sentiment
## 1 author     positive
## 2 author        trust
## 3   tree        anger
## 4   tree anticipation
## 5   tree      disgust
## 6   tree          joy
table(nrc_sentiment$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##          762          947          437          876          770         1506 
##     positive      sadness     surprise        trust 
##         2185          720          470         1272
ggplot(nrc_sentiment, aes(y = sentiment))+
  geom_bar(aes(fill = sentiment))+
  theme_minimal()+
  labs(title = "NRC Sentiments in California Subreddit Posts - Heat")

Figure 52: NRC Sentiments in California Subreddit Posts - Heat

4.2.2 Keyword: Warm

#warm_urls <- find_thread_urls(keywords = "warm", subreddit="California", period  = "all")
#warm_comments <- get_thread_content(warm_urls$url)

warm_comments <- read.csv("proj data/warm_comments_reddit.csv")
tidy_comments <- warm_comments %>%
  unnest_tokens(word, comment) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_comments %<>%
  anti_join(stop_words)
## Joining, by = "word"
#View(tidy_comments)
afinn <- tidy_comments %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = timestamp %/% 604800) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing <- tidy_comments %>% 
  inner_join(get_sentiments("bing")) %>% 
  mutate(method = "Bing et al.") %>%
  count(method, index = timestamp %/% 604800, sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
nrc <- tidy_comments %>% 
  inner_join(get_sentiments("nrc") %>%
               # only get positive and negative sentiments
               filter(sentiment %in% c("positive", "negative"))) %>% 
  mutate(method = "NRC") %>%
  count(method, index = timestamp %/% 604800, sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
all_three <- bind_rows(afinn,
                       bing, 
                       nrc)

ggplot(all_three, aes(index, sentiment, fill = method)) +
  geom_col(aes(fill = method))+
  facet_wrap(~method, ncol = 1, scales = "free_y")

Figure 53: Sentiment Analysis in Reddit Threads with keyword “Warm” over Time

Sentiment analysis of the keyword “warm” shows a relatively sparse dataset, especially during the earlier years. Similar to the sentiment analysis for the keyword “hot”, there is a lack of general trends to be drawn from the sentiment analysis, even though we can see that there are more extreme positive and negative sentiments regarding the search term “heat” in the later years, which is also what we observed for the sentiment analysis for the keyword “warm”.

nrc_sentiment <- tidy_comments %>%
  inner_join(get_sentiments("nrc"))
## Joining, by = "word"
nrc_sentiment %>%
  select(word, sentiment) %>%
  head()
##       word sentiment
## 1 juvenile  negative
## 2     love       joy
## 3     love  positive
## 4    sting     anger
## 5    sting      fear
## 6    sting  negative
table(nrc_sentiment$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##          610          834          487          538          831         1227 
##     positive      sadness     surprise        trust 
##         1969          440          365         1094
ggplot(nrc_sentiment, aes(y = sentiment))+
  geom_bar(aes(fill = sentiment))+
  theme_minimal()+
  labs(title = "NRC Sentiments in California Subreddit Posts - Warm")

Figure 54: NRC Sentiments in California Subreddit Posts - Warm

4.2.3 Keyword: Hot

#hot_urls <- find_thread_urls(keywords = "hot", subreddit="California", period  = "all")
#hot_comments <- get_thread_content(hot_urls$url)

hot_comments <- read.csv("proj data/hot_comments_reddit.csv")
tidy_comments <- hot_comments %>%
  unnest_tokens(word, comment) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_comments %<>%
  anti_join(stop_words)
## Joining, by = "word"
#View(tidy_comments)
afinn <- tidy_comments %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = timestamp %/% 604800) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing <- tidy_comments %>% 
  inner_join(get_sentiments("bing")) %>% 
  mutate(method = "Bing et al.") %>%
  count(method, index = timestamp %/% 604800, sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
nrc <- tidy_comments %>% 
  inner_join(get_sentiments("nrc") %>%
               # only get positive and negative sentiments
               filter(sentiment %in% c("positive", "negative"))) %>% 
  mutate(method = "NRC") %>%
  count(method, index = timestamp %/% 604800, sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
all_three <- bind_rows(afinn,
                       bing, 
                       nrc)

ggplot(all_three, aes(index, sentiment, fill = method)) +
  geom_col(aes(fill = method))+
  facet_wrap(~method, ncol = 1, scales = "free_y")

Figure 55: Sentiment Analysis in Reddit Threads with keyword “Hot” over Time

Sentiment analysis of the keyword “hot” also shows a relatively sparse dataset, especially during the earlier years, similar to the keywords “heat” and “warm”. However, unlike the earlier two keywords, sentiment analysis yields largely positive sentiments regardless of the lexicon used. This might be due to the many ways that Reddit users could use the word “hot”, that might be associated with positive sentiments, that may not be relevant for this investigation.

nrc_sentiment <- tidy_comments %>%
  inner_join(get_sentiments("nrc"))
## Joining, by = "word"
nrc_sentiment %>%
  select(word, sentiment) %>%
  head()
##       word    sentiment
## 1    beach          joy
## 2    start anticipation
## 3 favorite          joy
## 4 favorite     positive
## 5 favorite        trust
## 6    seals        trust
table(nrc_sentiment$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##           49           71           26           65           61          111 
##     positive      sadness     surprise        trust 
##          157           44           25           99
ggplot(nrc_sentiment, aes(y = sentiment))+
  geom_bar(aes(fill = sentiment))+
  theme_minimal()+
  labs(title = "NRC Sentiments in California Subreddit Posts - Hot")

Figure 56: NRC Sentiments in California Subreddit Posts - Hot

4.2.4 Keyword: Temperature

#temperature_urls <- find_thread_urls(keywords = "temperature", subreddit="California", period  = "all")
#temperature_comments <- get_thread_content(temperature_urls$url)

temperature_comments <- read.csv("proj data/temperature_comments_reddit.csv")
tidy_comments <- temperature_comments %>%
  unnest_tokens(word, comment) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_comments %<>%
  anti_join(stop_words)
## Joining, by = "word"
#View(tidy_comments)
afinn <- tidy_comments %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = timestamp %/% 604800) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing <- tidy_comments %>% 
  inner_join(get_sentiments("bing")) %>% 
  mutate(method = "Bing et al.") %>%
  count(method, index = timestamp %/% 604800, sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
nrc <- tidy_comments %>% 
  inner_join(get_sentiments("nrc") %>%
               # only get positive and negative sentiments
               filter(sentiment %in% c("positive", "negative"))) %>% 
  mutate(method = "NRC") %>%
  count(method, index = timestamp %/% 604800, sentiment) %>% 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
all_three <- bind_rows(afinn,
                       bing, 
                       nrc)

ggplot(all_three, aes(index, sentiment, fill = method)) +
  geom_col(aes(fill = method))+
  facet_wrap(~method, ncol = 1, scales = "free_y")

Figure 57: Sentiment Analysis in Reddit Threads with keyword “Temperature” over Time

Sentiment analysis of the keyword “temperature” also shows a relatively sparse dataset, especially during the earlier years, similar to the keywords “heat” and “warm”. Similar to the keyword “hot”, sentiment analysis yields largely positive sentiments regardless of the lexicon, which might be due to alternative meanings that extend beyond the scope of our investigations as well. The magnitudes of both positive and negative sentiments seem to taper off through the years here, as opposed to the keyword “hot”.

nrc_sentiment <- tidy_comments %>%
  inner_join(get_sentiments("nrc"))
## Joining, by = "word"
nrc_sentiment %>%
  select(word, sentiment) %>%
  head()
##       word sentiment
## 1 juvenile  negative
## 2     love       joy
## 3     love  positive
## 4    sting     anger
## 5    sting      fear
## 6    sting  negative
table(nrc_sentiment$sentiment)
## 
##        anger anticipation      disgust         fear          joy     negative 
##          844         1031          643          832          760         1669 
##     positive      sadness     surprise        trust 
##         2046          698          517         1100
ggplot(nrc_sentiment, aes(y = sentiment))+
  geom_bar(aes(fill = sentiment))+
  theme_minimal()+
  labs(title = "NRC Sentiments in California Subreddit Posts - Temperature")

Figure 58: NRC Sentiments in California Subreddit Posts - Temperature

Sentiment analysis (with respect to the NRC lexicon) for the four keywords show similar results, with the strongest sentiment being “positive” for all four keywords, with the next strongest sentiment being “negative” sentiment.

4.3 Los Angeles Times (Proquest)

ProQuest TDM was also used to generate visualisations on Geographical Analysis and Sentiment Analysis for the search term “Rising Temperatures”. Our visualisations are based on a dataset generated from a compilation of Los Angeles Times articles from 1886-1922, and 1923-1995, and Los Angeles Sentinel articles from 1934. The dataset has a total of 8901 documents.

Geographic Analysis
Geographic Analysis

Based on the ProQuest Geographical Analysis, most of the news that matched the search term came from California. From this, it is fair to conclude that most of the sentiment analysis regarding “rising temperatures” comes from Californian residents.

Sentiment Analysis
Sentiment Analysis

Based on the ProQuest Sentiment Analysis, we can look at the emotions associated with mental health:
- [Fear] Difficult to discern a clear trend as it seems to be fluctuating from 1920 to 2020
- [Happiness] Difficult to discern a clear trend as it seems to be fluctuating. However, there seems to be a steady increase from 1980 onwards
- [Sadness] Difficult to discern a clear trend as it seems to be fluctuating. However, there seems to be a steady increase from 1980 onwards, similar to happiness.

5. Conclusion & Summary

No. Final summary on the methods used in the investigation:
1. Literature review during the initial investigations highlighted that there were two conflicting views involved in the analysis of the impact of rising temperatures.
2. Visualisations and investigation of certain trends using ggplot did find a general increase in the trends of both temperatures and their associated impacts.
3. Correlation studies proved positive, but weak, correlation between rising temperatures and suicide rates in California.
4. Sentiment analysis have limited usefulness in determining emotions associated with high temperatures with respect to time, but does also provide some insights to sentiments and concerns regarding rising temperatures and clues to how they may link to mental health.

6. Annex

6.1 Spatial Analysis

6.1.1 Spatial Analysis: Suicide Rates vs Average Temperature

Suicide Rates

counties <- counties(state = "CA")
## Retrieving data for the year 2021
#ggplot(counties)+
#  geom_sf()
california_suicide_spatial = read.csv("proj data/[County Level] California County Suicide Rate.csv")

counties %<>%
  left_join(california_suicide_spatial, by = c("NAME" = "COUNTY"))

counties %<>% 
  drop_na("YEAR") #drop any NA values
#View(counties)

ggplot(counties)+
  geom_sf(aes(fill = `SUICIDE.RATE..PER.100.000.`))+
  scale_fill_gradient(high = "Blue", low = "White")+
  facet_wrap(~YEAR, ncol = 5)+ 
  labs(title = "Spatial Plot of California counties representing Suicide Rates over the years", fill = "Suicide Rate (Per 100,000 People)") +
  theme(axis.text = element_text(size = 6))

  #creating different ggplots showcasing suicide rates per 100,000, separated by year
Figure 59: Spatial Plot of California counties representing Suicide Rates over the years

Nothing really remarkable, just that Northern area seems to suffer a higher rate of suicides as compared to the Southern area, as evident by the whiter shade at the South and redder shade at the North from 2010 to 2019.

Average Temperature

counties_1 <- counties(state = "CA")
## Retrieving data for the year 2021
#ggplot(counties_1)+
#  geom_sf()
california_temperature_spatial = read.csv("proj data/County_Level_California_Avg,_Mix,_Max_Temp_and_Precipitation.csv")

counties_1 %<>%
  left_join(california_temperature_spatial, by = c("NAME" = "COUNTY"))

counties_1 %<>% 
  drop_na("YEAR") %>% #drop any NA values
  filter(YEAR >= 2010 & YEAR <= 2019) 
#View(counties_1)

ggplot(counties_1)+
  geom_sf(aes(fill = `AVERAGE.TEMP..C.`))+
  scale_fill_gradient(high = "Red", low = "White")+
  facet_wrap(~YEAR, ncol = 5)+ 
  labs(title = "Spatial Plot of California counties representing Average Temperature over the years", fill = "Average Temperature (C)") +
  theme(axis.text = element_text(size = 6))

#creating different ggplots showcasing average temperature (C), separated by year
Figure 60: Spatial Plot of California counties representing Average Temperature over the years

While the spatial analysis of the suicide rates alone did not yield interesting results, when you consider the spatial analysis of suicide rates and average temperature, it seems interesting to note that the northern sides (which has a higher suicide rate) seems to have a lower temperature as compared to the southern side (which has a lower suicide rate).

6.1.2 Spatial Analysis: Change in Suicide Rates vs Change in Average Temperature

Change in Suicide Rates

ggplot(counties)+
  geom_sf(aes(fill = `CHANGE.IN.SUICIDE.RATE`))+
  scale_fill_gradient(high = "Blue", low = "White")+
  facet_wrap(~YEAR, ncol = 5)+ 
  labs(title = "Spatial Plot of California counties' Change in Suicide Rates over the years", fill = "Change in Suicide Rate (Per 100,000 People)")+
  theme(axis.text = element_text(size = 6))

Figure 61: Spatial Plot of California counties’ Change in Suicide Rates over the years

It is difficult to distinguish changes in suicide rates in different counties as they are look relatively the same (i.e. similar shades of purple).

Change in Average Temperature

ggplot(counties_1)+
  geom_sf(aes(fill = `CHANGE.IN.AVG.TEMP`))+
  scale_fill_gradient(high = "Red", low = "White")+
  facet_wrap(~YEAR, ncol = 5)+ 
  labs(title = "Spatial Plot of California counties' Average Temperature over the years", fill = "Average Temperature (C)")+
  theme(axis.text = element_text(size = 6))

Figure 62: Spatial Plot of California counties’ Average Temperature over the years

The changes in average temperature seems rather erratic and happens rather uniformly across counties.

We conclude that it is difficult to establish potential association between the changes in the two factors.

6.2 Sentiment Analysis - GuardianAPI

Recognizing the lack of data set from the Reddit API, we also explored using Guardian API to extract news articles, related to the two keywords “California” and “Heat”, from the Guardian. At the same time, we made sure to narrow down the section to “US News” given our interests in understanding sentiments towards the rising temperature among people in California. Through sentiment analysis, we hope to uncover clues pertaining to how rising temperatures can affect mental health, since emotions like fear, negative and joy are closely related to one’s mental health and wellbeing.

*Note: images of results are attached because there was difficulties in knittinf for the GuardianAPI section.
# gu_api_key()
# ca_temperature <- gu_content('"California" AND "heat"', from_date = "2013-01-01", sectionId = "us-news")
# write_csv(ca_temperature, file = "proj data/guardian_data_heat.csv")
# ca_temperature_us <- ca_temperature %>%
#   filter(section_name == "US news")
# tidy_cc <- ca_temperature_us %>%
#   unnest_tokens(word, body_text)
# 
# tidy_cc %<>%
#   anti_join(stop_words)
# 
# afinn <- tidy_cc %>%
#   inner_join(get_sentiments("afinn")) %>% 
#   mutate(web_publication_date_numeric = as.numeric(web_publication_date)) %>%
#   group_by(index = floor(web_publication_date_numeric / 604800)) %>%
#   summarise(sentiment = sum(value)) %>%
#   mutate(web_publication_date_interval = as.POSIXct(index * 604800, origin = "1970-01-01", tz = "UTC")) %>%
#   mutate(method = "AFINN")
# 
# ggplot(afinn, aes(index, sentiment, fill = method)) +
#   geom_col()
Sentiments in Guardian Posts using AFINN method
Sentiments in Guardian Posts using AFINN method
# bing <- tidy_cc %>% 
#   inner_join(get_sentiments("bing")) %>% 
#   mutate(method = "Bing et al.") %>%
#   mutate(web_publication_date_numeric = as.numeric(web_publication_date)) %>%
#   count(method, index = floor(web_publication_date_numeric %/% 604800), sentiment) %>% 
#   pivot_wider(names_from = sentiment,
#               values_from = n,
#               values_fill = 0) %>%
#   mutate(web_publication_date_interval = as.POSIXct(index * 604800, origin = "1970-01-01", tz = "UTC")) %>%
#   mutate(sentiment = positive - negative)
# 
# nrc <- tidy_cc %>% 
#   inner_join(get_sentiments("nrc") %>%
#                # only get positive and negative sentiments
#                filter(sentiment %in% c("positive", "negative"))) %>% 
#   mutate(method = "NRC") %>%
#   mutate(web_publication_date_numeric = as.numeric(web_publication_date)) %>%
#   count(method, index = floor(web_publication_date_numeric %/% 604800), sentiment) %>% 
#   pivot_wider(names_from = sentiment,
#               values_from = n,
#               values_fill = 0) %>%
#   mutate(web_publication_date_interval = as.POSIXct(index * 604800, origin = "1970-01-01", tz = "UTC")) %>%
#   mutate(sentiment = positive - negative)
# all_three <- bind_rows(afinn,
#                        bing, 
#                        nrc)
# 
# ggplot(all_three, aes(index, sentiment, fill = method)) +
#   geom_col(aes(fill = method))+
#   facet_wrap(~method, ncol = 1, scales = "free_y")

Sentiment Analysis across all three methods Based on the sentiment analysis with reference to the AFINN and Bing Lexicon, we noted the generally negative sentiment towards the heat in California. However, we were also surprised to see a seemingly parabolic trend in the graphs, which would be an area worth exploring in the future.

# tidy_body_text <- tidy_cc%>%
#   inner_join(get_sentiments("nrc"))
# 
# table(tidy_body_text$sentiment)
# 
# ggplot(tidy_body_text, aes(y = sentiment))+
#   geom_bar(aes(fill = sentiment))+
#   theme_minimal()+
#   labs(title = "Sentiments in Guardian Posts with Keyword 'Heat' and 'California', US News ")
Sentiments in Guardian Posts with Keyword ‘Heat’ and ‘California’, US News (using NRC)
Sentiments in Guardian Posts with Keyword ‘Heat’ and ‘California’, US News (using NRC)
# tidy_body_text_1 <- tidy_cc%>%
#   inner_join(get_sentiments("bing"))
# 
# table(tidy_body_text_1$sentiment)
# 
# ggplot(tidy_body_text_1, aes(y = sentiment))+
#   geom_bar(aes(fill = sentiment))+
#   theme_minimal()+
#   labs(title = "Sentiments in Guardian Posts with Keyword 'Heat' and 'California', US News ")
Sentiments in Guardian Posts with Keyword ‘Heat’ and ‘California’, US News (using Bing)
Sentiments in Guardian Posts with Keyword ‘Heat’ and ‘California’, US News (using Bing)
# tidy_body_text_2 <- tidy_cc%>%
#   inner_join(get_sentiments("afinn"))
# 
# table(tidy_body_text_2$sentiment)
# 
# ggplot(tidy_body_text_2, aes(y = value))+
#   geom_bar(aes(fill = as.factor(value)))+
#   theme_minimal()+
#   labs(title = "Sentiments in Guardian Posts with Keyword 'Heat' and 'California', US News ")
Sentiments in Guardian Posts with Keyword ‘Heat’ and ‘California’, US News (using AFINN)
Sentiments in Guardian Posts with Keyword ‘Heat’ and ‘California’, US News (using AFINN)

7. Project Contributions

Name Tasks
Ernest - Ideation of topic, writing of Final Report Proposal
- Conducted Literature Review and summarized key pointers from research papers
- Data Collection, Graph Plotting, Data Analysis and Coding (Trends, Sentiment Analysis, Annex)
- Presentation slides creation
- Report Writing (Literature Review)
Zhi Yan - Ideation of topic, writing of Final Report Proposal
- Data Analysis, Graph Plotting, Coding (Regression)
- Report Writing (Overall)
Zheng Yang - Ideation of topic, writing of Final Report Proposal
- Data Collection
- Report Writing (Literature Review)